{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
    "\n",
    "<i>Licensed under the MIT License.</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# User2Item recommendations with LightGCN \n",
    "We offer an example to help readers to run a ID-based collaborative filtering baseline with LightGCN. <br>\n",
    "LightGCN is a simple and neat Graph Convolution Network (GCN) model for recommender systems.  I It uses a GCN to learn the embeddings of users/items, with the goal that low-order and high-order user-item interactions are explicitly exploited into the embedding function.\n",
    "<img src=\"https://recodatasets.z20.web.core.windows.net/kdd2020/images%2FLightGCN-graphexample.JPG\" width=\"600\">\n",
    "\n",
    "\n",
    "\n",
    "The model architecture is illustrated as follows:\n",
    "<img src=\"https://recodatasets.z20.web.core.windows.net/images/lightGCN-model.jpg\" width=\"600\">\n",
    "\n",
    "For more details and instructions, please refer to [lightgcn_deep_dive.ipynb](../../02_model_collaborative_filtering/lightgcn_deep_dive.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append(\"../../../\")\n",
    "import os\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "from reco_utils.common.timer import Timer\n",
    "from reco_utils.recommender.deeprec.models.graphrec.lightgcn import LightGCN\n",
    "from reco_utils.recommender.deeprec.DataModel.ImplicitCF import ImplicitCF\n",
    "from reco_utils.dataset import movielens\n",
    "from reco_utils.dataset.python_splitters import python_stratified_split\n",
    "from reco_utils.evaluation.python_evaluation import map_at_k, ndcg_at_k, precision_at_k, recall_at_k\n",
    "from reco_utils.common.constants import SEED as DEFAULT_SEED\n",
    "from reco_utils.recommender.deeprec.deeprec_utils import prepare_hparams\n",
    "from reco_utils.recommender.deeprec.deeprec_utils import cal_metric\n",
    "from utils.general import *\n",
    "from utils.data_helper import *\n",
    "from utils.task_helper import *\n",
    "\n",
    "tf.logging.set_verbosity(tf.logging.ERROR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "tag = 'small'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "lightgcn_dir = 'data_folder/my/LightGCN-training-folder'\n",
    "rawdata_dir = 'data_folder/my/DKN-training-folder'\n",
    "create_dir(lightgcn_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we need to transform the raw dataset into LightGCN's input data format:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "load_instance_file: train_small.txt   done.\n",
      "load_instance_file: valid_small.txt   done.\n",
      "load_instance_file: test_small.txt   done.\n"
     ]
    }
   ],
   "source": [
    "prepare_dataset(lightgcn_dir, rawdata_dir, tag)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train = pd.read_csv(\n",
    "        os.path.join(lightgcn_dir, 'lightgcn_train_{0}.txt'.format(tag)),\n",
    "        sep=' ',\n",
    "        engine=\"python\",\n",
    "        names=['userID', 'itemID', 'rating'],\n",
    "        header=0\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userID</th>\n",
       "      <th>itemID</th>\n",
       "      <th>rating</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2556758139</td>\n",
       "      <td>1639559569</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2556758139</td>\n",
       "      <td>2750948673</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2556758139</td>\n",
       "      <td>3009232636</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2556758139</td>\n",
       "      <td>1997686688</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2630447844</td>\n",
       "      <td>2253252279</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       userID      itemID  rating\n",
       "0  2556758139  1639559569       0\n",
       "1  2556758139  2750948673       0\n",
       "2  2556758139  3009232636       0\n",
       "3  2556758139  1997686688       0\n",
       "4  2630447844  2253252279       1"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LightGCN only takes positive user-item interactions for model training. Pairs with rating < 1 will be ignored by the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_valid = pd.read_csv(\n",
    "        os.path.join(lightgcn_dir, 'lightgcn_valid_{0}.txt'.format(tag)),\n",
    "        sep=' ',\n",
    "        engine=\"python\",\n",
    "        names=['userID', 'itemID', 'rating'],\n",
    "        header=0\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = ImplicitCF(\n",
    "    train=df_train, test=df_valid, seed=0,\n",
    "    col_user='userID',\n",
    "    col_item='itemID',\n",
    "    col_rating='rating'\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "yaml_file = './lightgcn.yaml'\n",
    "\n",
    "\n",
    "hparams = prepare_hparams(yaml_file,                          \n",
    "                          learning_rate=0.005,\n",
    "                          eval_epoch=1,\n",
    "                          top_k=10,\n",
    "                          save_model=True,\n",
    "                          epochs=15,\n",
    "                          save_epoch=1\n",
    "                         )\n",
    "hparams.MODEL_DIR = os.path.join(lightgcn_dir, 'saved_models')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<bound method HParams.values of HParams([('DNN_FIELD_NUM', None), ('EARLY_STOP', 100), ('FEATURE_COUNT', None), ('FIELD_COUNT', None), ('L', None), ('MODEL_DIR', 'data_folder/my/LightGCN-training-folder/saved_models'), ('PAIR_NUM', None), ('SUMMARIES_DIR', None), ('T', None), ('activation', None), ('att_fcn_layer_sizes', None), ('attention_activation', None), ('attention_dropout', 0.0), ('attention_layer_sizes', None), ('attention_size', None), ('batch_size', 1024), ('cate_embedding_dim', None), ('cate_vocab', None), ('contextEmb_file', None), ('cross_activation', 'identity'), ('cross_l1', 0.0), ('cross_l2', 0.0), ('cross_layer_sizes', None), ('cross_layers', None), ('data_format', None), ('decay', 0.0001), ('dilations', None), ('dim', None), ('doc_size', None), ('dropout', [0.0]), ('dtype', 32), ('embed_l1', 0.0), ('embed_l2', 0.0), ('embed_size', 64), ('embedding_dropout', 0.3), ('enable_BN', False), ('entityEmb_file', None), ('entity_dim', None), ('entity_embedding_method', None), ('entity_size', None), ('epochs', 15), ('eval_epoch', 1), ('fast_CIN_d', 0), ('filter_sizes', None), ('hidden_size', None), ('history_size', None), ('init_method', 'tnormal'), ('init_value', 0.01), ('is_clip_norm', 0), ('item_embedding_dim', None), ('item_vocab', None), ('iterator_type', None), ('kernel_size', None), ('kg_file', None), ('kg_training_interval', 5), ('layer_l1', 0.0), ('layer_l2', 0.0), ('layer_sizes', None), ('learning_rate', 0.005), ('load_model_name', None), ('load_saved_model', False), ('loss', None), ('lr_kg', 0.5), ('lr_rs', 1), ('max_grad_norm', 2), ('max_seq_length', None), ('method', None), ('metrics', ['recall', 'ndcg', 'precision', 'map']), ('min_seq_length', 1), ('model_type', 'lightgcn'), ('mu', None), ('n_h', None), ('n_item', None), ('n_item_attr', None), ('n_layers', 3), ('n_user', None), ('n_user_attr', None), ('n_v', None), ('need_sample', True), ('news_feature_file', None), ('num_filters', None), ('optimizer', 'adam'), ('pairwise_metrics', None), ('reg_kg', 0.0), ('save_epoch', 1), ('save_model', True), ('show_step', 1), ('top_k', 10), ('train_num_ngs', 4), ('train_ratio', None), ('transform', None), ('use_CIN_part', False), ('use_DNN_part', False), ('use_FM_part', False), ('use_Linear_part', False), ('use_context', True), ('use_entity', True), ('user_clicks', None), ('user_dropout', False), ('user_embedding_dim', None), ('user_history_file', None), ('user_vocab', None), ('wordEmb_file', None), ('word_size', None), ('write_tfevents', False)])>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hparams.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Already create adjacency matrix.\n",
      "Already normalize adjacency matrix.\n",
      "Using xavier initialization.\n"
     ]
    }
   ],
   "source": [
    "model = LightGCN(hparams, data, seed=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_1\n",
      "Epoch 1 (train)13.8s + (eval)1.2s: train loss = 0.08667 = (mf)0.08563 + (embed)0.00104, recall = 0.18498, ndcg = 0.09494, precision = 0.01850, map = 0.06812\n",
      "Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_2\n",
      "Epoch 2 (train)12.8s + (eval)1.1s: train loss = 0.01980 = (mf)0.01793 + (embed)0.00187, recall = 0.22820, ndcg = 0.12585, precision = 0.02282, map = 0.09494\n",
      "Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_3\n",
      "Epoch 3 (train)12.8s + (eval)1.1s: train loss = 0.01252 = (mf)0.01021 + (embed)0.00231, recall = 0.25020, ndcg = 0.13682, precision = 0.02502, map = 0.10265\n",
      "Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_4\n",
      "Epoch 4 (train)12.8s + (eval)1.1s: train loss = 0.00932 = (mf)0.00676 + (embed)0.00256, recall = 0.26738, ndcg = 0.14560, precision = 0.02674, map = 0.10878\n",
      "Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_5\n",
      "Epoch 5 (train)12.8s + (eval)1.0s: train loss = 0.00768 = (mf)0.00498 + (embed)0.00270, recall = 0.27402, ndcg = 0.15165, precision = 0.02740, map = 0.11438\n",
      "Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_6\n",
      "Epoch 6 (train)12.8s + (eval)1.0s: train loss = 0.00665 = (mf)0.00390 + (embed)0.00275, recall = 0.27740, ndcg = 0.15228, precision = 0.02774, map = 0.11405\n",
      "Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_7\n",
      "Epoch 7 (train)12.7s + (eval)1.0s: train loss = 0.00598 = (mf)0.00324 + (embed)0.00273, recall = 0.28547, ndcg = 0.14945, precision = 0.02855, map = 0.10802\n",
      "Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_8\n",
      "Epoch 8 (train)12.8s + (eval)1.0s: train loss = 0.00537 = (mf)0.00268 + (embed)0.00268, recall = 0.29524, ndcg = 0.15881, precision = 0.02952, map = 0.11722\n",
      "Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_9\n",
      "Epoch 9 (train)12.8s + (eval)1.0s: train loss = 0.00500 = (mf)0.00239 + (embed)0.00261, recall = 0.29719, ndcg = 0.15873, precision = 0.02972, map = 0.11644\n",
      "Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_10\n",
      "Epoch 10 (train)12.9s + (eval)1.1s: train loss = 0.00467 = (mf)0.00213 + (embed)0.00254, recall = 0.29524, ndcg = 0.15681, precision = 0.02952, map = 0.11459\n",
      "Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_11\n",
      "Epoch 11 (train)12.8s + (eval)1.0s: train loss = 0.00444 = (mf)0.00197 + (embed)0.00247, recall = 0.30630, ndcg = 0.16187, precision = 0.03063, map = 0.11790\n",
      "Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_12\n",
      "Epoch 12 (train)13.0s + (eval)1.1s: train loss = 0.00420 = (mf)0.00180 + (embed)0.00240, recall = 0.30617, ndcg = 0.16130, precision = 0.03062, map = 0.11739\n",
      "Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_13\n",
      "Epoch 13 (train)13.0s + (eval)1.1s: train loss = 0.00397 = (mf)0.00162 + (embed)0.00235, recall = 0.30565, ndcg = 0.16252, precision = 0.03056, map = 0.11902\n",
      "Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_14\n",
      "Epoch 14 (train)12.9s + (eval)1.1s: train loss = 0.00380 = (mf)0.00150 + (embed)0.00230, recall = 0.30851, ndcg = 0.16434, precision = 0.03085, map = 0.12043\n",
      "Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_15\n",
      "Epoch 15 (train)12.9s + (eval)1.1s: train loss = 0.00366 = (mf)0.00140 + (embed)0.00226, recall = 0.31567, ndcg = 0.16733, precision = 0.03157, map = 0.12219\n",
      "Took 210.9439941626042 seconds for training.\n"
     ]
    }
   ],
   "source": [
    "with Timer() as train_time:\n",
    "    model.fit()\n",
    "\n",
    "print(\"Took {} seconds for training.\".format(train_time.interval))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "user_emb_file = os.path.join(lightgcn_dir, 'user.emb.txt')\n",
    "item_emb_file = os.path.join(lightgcn_dir, 'item.emb.txt')\n",
    "model.infer_embedding(\n",
    "    user_emb_file,\n",
    "    item_emb_file    \n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To compare LightGCN's performance with DKN, we need to make predictions on the same test set. So we infer the users/items embedding, then compute the similarity scores between each pairs of user-item in the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def infer_scores_via_embeddings(test_filename, user_emb_file, item_emb_file):\n",
    "    print('loading embedding file...', end=' ')\n",
    "    user2vec = load_emb_file(user_emb_file)\n",
    "    item2vec = load_emb_file(item_emb_file)\n",
    "    preds, labels, groupids = [], [], []\n",
    "    with open(test_filename, 'r') as rd:\n",
    "        while True:\n",
    "            line = rd.readline()\n",
    "            if not line:\n",
    "                break\n",
    "            words = line.strip().split('%')\n",
    "            tokens = words[0].split(' ')\n",
    "            userid = words[1]\n",
    "            itemid = tokens[2]\n",
    "            pred = user2vec[userid].dot(item2vec[itemid])\n",
    "            preds.append(pred)\n",
    "            labels.append(int(tokens[0]))\n",
    "            groupids.append(userid)\n",
    "    print('done')\n",
    "    return labels, preds, groupids\n",
    "            "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "loading embedding file... done\n"
     ]
    }
   ],
   "source": [
    "test_filename = os.path.join(rawdata_dir, 'test_{}.txt'.format(tag)) \n",
    "labels, preds, group_keys = infer_scores_via_embeddings(test_filename, user_emb_file, item_emb_file)\n",
    "group_labels, group_preds = group_labels(labels, preds, group_keys)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'ndcg@2': 0.4026, 'ndcg@4': 0.4953, 'ndcg@6': 0.5346, 'group_auc': 0.8096}\n",
      "{'auc': 0.8092}\n"
     ]
    }
   ],
   "source": [
    "res_pairwise = cal_metric(\n",
    "                group_labels, group_preds, ['ndcg@2;4;6', \"group_auc\"]\n",
    "            )\n",
    "print(res_pairwise)\n",
    "res_pointwise = cal_metric(labels, preds, ['auc'])\n",
    "print(res_pointwise)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reference: \n",
    "1. Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang & Meng Wang, LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation, 2020, https://arxiv.org/abs/2002.02126"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (reco_gpu)",
   "language": "python",
   "name": "reco_gpu"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
